Common Information: 1. Is for numeric values 2. Linear regression is based on the assumption that a linear relationship exists between the input and output variable. 3. Learning Alg. will learn the set of parameters such that the sum of square error(yactual - yestimate)2 is minimized.

Load Prestige Data:

library(car)
Prestige
summary(Prestige)
   education          income          women           prestige         census       type   
 Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80   Min.   :1113   bc  :44  
 1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23   1st Qu.:3120   prof:31  
 Median :10.540   Median : 5930   Median :13.600   Median :43.60   Median :5135   wc  :23  
 Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83   Mean   :5402   NA's: 4  
 3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27   3rd Qu.:8312            
 Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20   Max.   :9517            
pairs(main="Prestige Data", Prestige)

plot.ts(main="Prestige Data", Prestige)

In correlation matrix we see, that education and income in relation to prestige standing.

d <- data.frame(education = Prestige$education, income = Prestige$income, women = Prestige$women, prestige = Prestige$prestige, census = Prestige$census)
m <- cor(d)
library('corrplot')
corrplot(m, method="circle")

Education and income has a high influence to the prestige !!!

                                  Example 1 (Prestige).

Prepare Data Prestige: Build index which later use only 1/4 from data as Training.

testidx_pre <- which(1:nrow(Prestige)%%4==0)

Split data:

prestige_train <- Prestige[-testidx_pre,]
prestige_test <- Prestige[testidx_pre,]

Build a learn model: (for prestige attribute)

model <- lm(prestige~., data=prestige_train)

Use model for prediction the test data

prediciton <- predict(model, newdata=prestige_test)

Check correlation with current result, but why ?

cor(prediciton, prestige_test$prestige)
[1] 0.9376719

Show result:

summary(model)

Call:
lm(formula = prestige ~ ., data = prestige_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.9079  -5.0336   0.3159   5.3831  17.8852 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.071e+01  1.142e+01  -1.813  0.07437 .  
education    4.201e+00  8.291e-01   5.067 3.49e-06 ***
income       1.150e-03  3.511e-04   3.277  0.00168 ** 
women        3.630e-02  4.006e-02   0.906  0.36817    
census       1.865e-03  9.913e-04   1.881  0.06442 .  
typeprof     1.131e+01  7.393e+00   1.530  0.13075    
typewc       1.987e+00  4.958e+00   0.401  0.68984    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.416 on 66 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.8204,    Adjusted R-squared:  0.8041 
F-statistic: 50.26 on 6 and 66 DF,  p-value: < 2.2e-16
curve(4.201e+00*x + 1.150e-03,from=1, to=100, xlab="x", ylab="y")

plot(model)

prediciton
  purchasing.officers            architects           draughtsmen        social.workers             ministers            physicians 
             51.30093              75.85178              54.71590              63.92539              61.74322              92.95148 
        nursing.aides    commercial.artists               typists       shipping.clerks         postal.clerks         travel.clerks 
             31.61065              51.26625              44.32374              33.43635              37.56363              45.73771 
         sales.clerks  real.estate.salesmen                 cooks            launderers          farm.workers        slaughterers.2 
             38.52429              46.44319              28.69344              27.54240              31.72468              33.23791 
      tool.die.makers          auto.workers sewing.mach.operators    electrical.linemen                masons                pilots 
             46.48374              37.75900              28.62746              43.20600              30.26737              75.30271 
         longshoremen 
             37.28700 
                                  Example 2 (Census).

Prepare Data Prestige/Census: Build index which later use only 1/4 from data as Training.

testidx_pre <- which(1:nrow(Prestige)%%4==0)

Split data:

prestige_train <- Prestige[-testidx_pre,]
prestige_test <- Prestige[testidx_pre,]

Build a learn model: (for census attribute)

model <- lm(census~., data=prestige_train)

Use model for prediction the test data

prediciton <- predict(model, newdata=prestige_test)

Check correlation with current result, but why ?

cor(prediciton, prestige_test$census)
[1] 0.7521363

Show result:

summary(model)

Call:
lm(formula = census ~ ., data = prestige_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-2314.6  -280.6   137.6   555.8  1660.9 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.120e+03  8.624e+02  10.576 7.61e-16 ***
education   -2.249e+02  1.149e+02  -1.957   0.0546 .  
income      -2.631e-02  4.568e-02  -0.576   0.5666    
women       -1.740e-01  4.876e+00  -0.036   0.9716    
prestige     2.728e+01  1.451e+01   1.881   0.0644 .  
typeprof    -5.135e+03  6.547e+02  -7.844 5.04e-11 ***
typewc      -3.377e+03  4.334e+02  -7.791 6.26e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 897.1 on 66 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.8996,    Adjusted R-squared:  0.8905 
F-statistic:  98.6 on 6 and 66 DF,  p-value: < 2.2e-16
curve(-5.135e+03*x + -3.377e+03,from=1, to=10000, xlab="x", ylab="y")

plot(model)

prediciton
  purchasing.officers            architects           draughtsmen        social.workers             ministers            physicians 
             2731.290              2270.106              2668.436              2116.075              2585.987              2106.844 
        nursing.aides    commercial.artists               typists       shipping.clerks         postal.clerks         travel.clerks 
             7842.070              2884.537              4203.111              4397.013              4386.272              3975.482 
         sales.clerks  real.estate.salesmen                 cooks            launderers          farm.workers        slaughterers.2 
             4126.262              4346.245              8098.660              7948.086              7724.172              8213.242 
      tool.die.makers          auto.workers sewing.mach.operators    electrical.linemen                masons                pilots 
             7798.560              8048.416              8363.928              7981.632              8466.525              2659.409 
         longshoremen 
             7824.743 
                                  Example 3 (Prestige).

Prepare Data Prestige: Build index which later use only 1/4 from data as Training.

testidx_pre <- which(1:nrow(Prestige)%%10==0)

Split data:

prestige_train <- Prestige[-testidx_pre,]
prestige_test <- Prestige[testidx_pre,]

Build a learn model: (for prestige attribute)

model <- lm(prestige~., data=prestige_train)

Use model for prediction the test data

prediciton <- predict(model, newdata=prestige_test)

Check correlation with current result, but why ?

cor(prediciton, prestige_test$prestige)
[1] 0.9274643

Show result:

summary(model)

Call:
lm(formula = prestige ~ ., data = prestige_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.060  -5.405   1.202   5.051  19.119 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.295e+01  8.455e+00  -1.532  0.12948    
education    3.957e+00  6.907e-01   5.729 1.67e-07 ***
income       1.012e-03  2.775e-04   3.645  0.00047 ***
women        1.450e-02  3.244e-02   0.447  0.65612    
census       1.192e-03  6.420e-04   1.857  0.06695 .  
typeprof     1.076e+01  4.815e+00   2.235  0.02819 *  
typewc       6.187e-01  3.355e+00   0.184  0.85418    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.107 on 81 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.8394,    Adjusted R-squared:  0.8275 
F-statistic: 70.56 on 6 and 81 DF,  p-value: < 2.2e-16
curve(4.201e+00*x + 1.150e-03,from=1, to=100, xlab="x", ylab="y")

plot(model)

prediciton
    mining.engineers            ministers          pharmacists      shipping.clerks    sales.supervisors                cooks               bakers 
            69.47141             62.97919             72.66327             33.88479             40.53581             28.87879             31.40611 
        auto.workers construction.foremen         longshoremen 
            36.63099             39.11454             36.07945 
                                  Example 4 (Census).

Prepare Data Prestige/Census: Build index which later use only 1/4 from data as Training.

testidx_pre <- which(1:nrow(Prestige)%%10==0)

Split data:

prestige_train <- Prestige[-testidx_pre,]
prestige_test <- Prestige[testidx_pre,]

Build a learn model: (for census attribute)

model <- lm(census~., data=prestige_train)

Use model for prediction the test data

prediciton <- predict(model, newdata=prestige_test)

Check correlation with current result, but why ?

cor(prediciton, prestige_test$census)
[1] 0.9451727

Show result:

summary(model)

Call:
lm(formula = census ~ ., data = prestige_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4040.9  -384.0   141.7   602.9  5759.6 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.975e+03  9.407e+02  10.604  < 2e-16 ***
education   -3.830e+02  1.321e+02  -2.899  0.00481 ** 
income       7.575e-03  5.075e-02   0.149  0.88171    
women       -5.273e+00  5.474e+00  -0.963  0.33824    
prestige     3.425e+01  1.844e+01   1.857  0.06695 .  
typeprof    -4.292e+03  6.927e+02  -6.195 2.29e-08 ***
typewc      -2.592e+03  4.905e+02  -5.285 1.04e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1205 on 81 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.8129,    Adjusted R-squared:  0.7991 
F-statistic: 58.67 on 6 and 81 DF,  p-value: < 2.2e-16
curve(-4.292e+03*x + -2.592e+03,from=1, to=10000, xlab="x", ylab="y")

plot(model)

prediciton
    mining.engineers            ministers          pharmacists      shipping.clerks    sales.supervisors                cooks               bakers 
            2511.492             2637.236             2180.507             4904.864             5002.044             7777.201             8275.712 
        auto.workers construction.foremen         longshoremen 
            7948.091             8633.108             7699.212 

Columns with not marked with least one * can be safely ignored.

END

Goal of minimizing the square error makes linear regression very sensitive to outliers that greatly deviate in the output.

LS0tCnRpdGxlOiAiTGluZWFyIFJlZ3Jlc3Npb24gUHJlc3RpZ2UgRGF0YSwgRXhhbXBsZSBmcm9tIERab25lIFJlZmNhcmR6IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCkNvbW1vbiBJbmZvcm1hdGlvbjoKMS4gSXMgZm9yIG51bWVyaWMgdmFsdWVzCjIuIExpbmVhciByZWdyZXNzaW9uIGlzIGJhc2VkIG9uIHRoZSBhc3N1bXB0aW9uIHRoYXQgYSBsaW5lYXIgcmVsYXRpb25zaGlwIGV4aXN0cyBiZXR3ZWVuIHRoZSBpbnB1dCBhbmQgb3V0cHV0IHZhcmlhYmxlLgozLiBMZWFybmluZyBBbGcuIHdpbGwgbGVhcm4gdGhlIHNldCBvZiBwYXJhbWV0ZXJzIHN1Y2ggdGhhdCB0aGUgc3VtIG9mIHNxdWFyZSBlcnJvcih5YWN0dWFsIC0geWVzdGltYXRlKTIgaXMgbWluaW1pemVkLgoKTG9hZCBQcmVzdGlnZSBEYXRhOgpgYGB7cn0KbGlicmFyeShjYXIpCmBgYApgYGB7cn0KUHJlc3RpZ2UKYGBgCmBgYHtyfQpzdW1tYXJ5KFByZXN0aWdlKQpgYGAKYGBge3J9CnBhaXJzKG1haW49IlByZXN0aWdlIERhdGEiLCBQcmVzdGlnZSkKYGBgCmBgYHtyfQpwbG90LnRzKG1haW49IlByZXN0aWdlIERhdGEiLCBQcmVzdGlnZSkKYGBgCkluIGNvcnJlbGF0aW9uIG1hdHJpeCB3ZSBzZWUsIHRoYXQgZWR1Y2F0aW9uIGFuZCBpbmNvbWUgaW4gcmVsYXRpb24gdG8gcHJlc3RpZ2Ugc3RhbmRpbmcuIApgYGB7cn0KZCA8LSBkYXRhLmZyYW1lKGVkdWNhdGlvbiA9IFByZXN0aWdlJGVkdWNhdGlvbiwgaW5jb21lID0gUHJlc3RpZ2UkaW5jb21lLCB3b21lbiA9IFByZXN0aWdlJHdvbWVuLCBwcmVzdGlnZSA9IFByZXN0aWdlJHByZXN0aWdlLCBjZW5zdXMgPSBQcmVzdGlnZSRjZW5zdXMpCm0gPC0gY29yKGQpCmxpYnJhcnkoJ2NvcnJwbG90JykKY29ycnBsb3QobSwgbWV0aG9kPSJjaXJjbGUiKQpgYGAKRWR1Y2F0aW9uIGFuZCBpbmNvbWUgaGFzIGEgaGlnaCBpbmZsdWVuY2UgdG8gdGhlIHByZXN0aWdlICEhIQoKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBFeGFtcGxlIDEgKFByZXN0aWdlKS4KIyMjIwpQcmVwYXJlIERhdGEgUHJlc3RpZ2U6CkJ1aWxkIGluZGV4IHdoaWNoIGxhdGVyIHVzZSBvbmx5IDEvNCBmcm9tIGRhdGEgYXMgVHJhaW5pbmcuCmBgYHtyfQp0ZXN0aWR4X3ByZSA8LSB3aGljaCgxOm5yb3coUHJlc3RpZ2UpJSU0PT0wKQpgYGAKU3BsaXQgZGF0YToKYGBge3J9CnByZXN0aWdlX3RyYWluIDwtIFByZXN0aWdlWy10ZXN0aWR4X3ByZSxdCmBgYApgYGB7cn0KcHJlc3RpZ2VfdGVzdCA8LSBQcmVzdGlnZVt0ZXN0aWR4X3ByZSxdCmBgYApCdWlsZCBhIGxlYXJuIG1vZGVsOiAoZm9yIHByZXN0aWdlIGF0dHJpYnV0ZSkKYGBge3J9Cm1vZGVsIDwtIGxtKHByZXN0aWdlfi4sIGRhdGE9cHJlc3RpZ2VfdHJhaW4pCmBgYApVc2UgbW9kZWwgZm9yIHByZWRpY3Rpb24gdGhlIHRlc3QgZGF0YQpgYGB7cn0KcHJlZGljaXRvbiA8LSBwcmVkaWN0KG1vZGVsLCBuZXdkYXRhPXByZXN0aWdlX3Rlc3QpCmBgYApDaGVjayBjb3JyZWxhdGlvbiB3aXRoIGN1cnJlbnQgcmVzdWx0LCBidXQgd2h5ID8gCmBgYHtyfQpjb3IocHJlZGljaXRvbiwgcHJlc3RpZ2VfdGVzdCRwcmVzdGlnZSkKYGBgClNob3cgcmVzdWx0OgpgYGB7cn0Kc3VtbWFyeShtb2RlbCkKYGBgCmBgYHtyfQpjdXJ2ZSg0LjIwMWUrMDAqeCArIDEuMTUwZS0wMyxmcm9tPTEsIHRvPTEwMCwgeGxhYj0ieCIsIHlsYWI9InkiKQpgYGAKYGBge3J9CnBsb3QobW9kZWwpCmBgYApgYGB7cn0KcHJlZGljaXRvbgpgYGAKCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgRXhhbXBsZSAyIChDZW5zdXMpLgojIyMjClByZXBhcmUgRGF0YSBQcmVzdGlnZS9DZW5zdXM6CkJ1aWxkIGluZGV4IHdoaWNoIGxhdGVyIHVzZSBvbmx5IDEvNCBmcm9tIGRhdGEgYXMgVHJhaW5pbmcuCmBgYHtyfQp0ZXN0aWR4X3ByZSA8LSB3aGljaCgxOm5yb3coUHJlc3RpZ2UpJSU0PT0wKQpgYGAKU3BsaXQgZGF0YToKYGBge3J9CnByZXN0aWdlX3RyYWluIDwtIFByZXN0aWdlWy10ZXN0aWR4X3ByZSxdCmBgYApgYGB7cn0KcHJlc3RpZ2VfdGVzdCA8LSBQcmVzdGlnZVt0ZXN0aWR4X3ByZSxdCmBgYApCdWlsZCBhIGxlYXJuIG1vZGVsOiAoZm9yIGNlbnN1cyBhdHRyaWJ1dGUpCmBgYHtyfQptb2RlbCA8LSBsbShjZW5zdXN+LiwgZGF0YT1wcmVzdGlnZV90cmFpbikKYGBgClVzZSBtb2RlbCBmb3IgcHJlZGljdGlvbiB0aGUgdGVzdCBkYXRhCmBgYHtyfQpwcmVkaWNpdG9uIDwtIHByZWRpY3QobW9kZWwsIG5ld2RhdGE9cHJlc3RpZ2VfdGVzdCkKYGBgCkNoZWNrIGNvcnJlbGF0aW9uIHdpdGggY3VycmVudCByZXN1bHQsIGJ1dCB3aHkgPyAKYGBge3J9CmNvcihwcmVkaWNpdG9uLCBwcmVzdGlnZV90ZXN0JGNlbnN1cykKYGBgClNob3cgcmVzdWx0OgpgYGB7cn0Kc3VtbWFyeShtb2RlbCkKYGBgCmBgYHtyfQpjdXJ2ZSgtNS4xMzVlKzAzKnggKyAtMy4zNzdlKzAzLGZyb209MSwgdG89MTAwMDAsIHhsYWI9IngiLCB5bGFiPSJ5IikKYGBgCmBgYHtyfQpwbG90KG1vZGVsKQpgYGAKYGBge3J9CnByZWRpY2l0b24KYGBgCgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIEV4YW1wbGUgMyAoUHJlc3RpZ2UpLgojIyMjClByZXBhcmUgRGF0YSBQcmVzdGlnZToKQnVpbGQgaW5kZXggd2hpY2ggbGF0ZXIgdXNlIG9ubHkgMS80IGZyb20gZGF0YSBhcyBUcmFpbmluZy4KYGBge3J9CnRlc3RpZHhfcHJlIDwtIHdoaWNoKDE6bnJvdyhQcmVzdGlnZSklJTEwPT0wKQpgYGAKU3BsaXQgZGF0YToKYGBge3J9CnByZXN0aWdlX3RyYWluIDwtIFByZXN0aWdlWy10ZXN0aWR4X3ByZSxdCmBgYApgYGB7cn0KcHJlc3RpZ2VfdGVzdCA8LSBQcmVzdGlnZVt0ZXN0aWR4X3ByZSxdCmBgYApCdWlsZCBhIGxlYXJuIG1vZGVsOiAoZm9yIHByZXN0aWdlIGF0dHJpYnV0ZSkKYGBge3J9Cm1vZGVsIDwtIGxtKHByZXN0aWdlfi4sIGRhdGE9cHJlc3RpZ2VfdHJhaW4pCmBgYApVc2UgbW9kZWwgZm9yIHByZWRpY3Rpb24gdGhlIHRlc3QgZGF0YQpgYGB7cn0KcHJlZGljaXRvbiA8LSBwcmVkaWN0KG1vZGVsLCBuZXdkYXRhPXByZXN0aWdlX3Rlc3QpCmBgYApDaGVjayBjb3JyZWxhdGlvbiB3aXRoIGN1cnJlbnQgcmVzdWx0LCBidXQgd2h5ID8gCmBgYHtyfQpjb3IocHJlZGljaXRvbiwgcHJlc3RpZ2VfdGVzdCRwcmVzdGlnZSkKYGBgClNob3cgcmVzdWx0OgpgYGB7cn0Kc3VtbWFyeShtb2RlbCkKYGBgCmBgYHtyfQpjdXJ2ZSg0LjIwMWUrMDAqeCArIDEuMTUwZS0wMyxmcm9tPTEsIHRvPTEwMCwgeGxhYj0ieCIsIHlsYWI9InkiKQpgYGAKYGBge3J9CnBsb3QobW9kZWwpCmBgYApgYGB7cn0KcHJlZGljaXRvbgpgYGAKCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgRXhhbXBsZSA0IChDZW5zdXMpLgojIyMjClByZXBhcmUgRGF0YSBQcmVzdGlnZS9DZW5zdXM6CkJ1aWxkIGluZGV4IHdoaWNoIGxhdGVyIHVzZSBvbmx5IDEvNCBmcm9tIGRhdGEgYXMgVHJhaW5pbmcuCmBgYHtyfQp0ZXN0aWR4X3ByZSA8LSB3aGljaCgxOm5yb3coUHJlc3RpZ2UpJSUxMD09MCkKYGBgClNwbGl0IGRhdGE6CmBgYHtyfQpwcmVzdGlnZV90cmFpbiA8LSBQcmVzdGlnZVstdGVzdGlkeF9wcmUsXQpgYGAKYGBge3J9CnByZXN0aWdlX3Rlc3QgPC0gUHJlc3RpZ2VbdGVzdGlkeF9wcmUsXQpgYGAKQnVpbGQgYSBsZWFybiBtb2RlbDogKGZvciBjZW5zdXMgYXR0cmlidXRlKQpgYGB7cn0KbW9kZWwgPC0gbG0oY2Vuc3Vzfi4sIGRhdGE9cHJlc3RpZ2VfdHJhaW4pCmBgYApVc2UgbW9kZWwgZm9yIHByZWRpY3Rpb24gdGhlIHRlc3QgZGF0YQpgYGB7cn0KcHJlZGljaXRvbiA8LSBwcmVkaWN0KG1vZGVsLCBuZXdkYXRhPXByZXN0aWdlX3Rlc3QpCmBgYApDaGVjayBjb3JyZWxhdGlvbiB3aXRoIGN1cnJlbnQgcmVzdWx0LCBidXQgd2h5ID8gCmBgYHtyfQpjb3IocHJlZGljaXRvbiwgcHJlc3RpZ2VfdGVzdCRjZW5zdXMpCmBgYApTaG93IHJlc3VsdDoKYGBge3J9CnN1bW1hcnkobW9kZWwpCmBgYApgYGB7cn0KY3VydmUoLTQuMjkyZSswMyp4ICsgLTIuNTkyZSswMyxmcm9tPTEsIHRvPTEwMDAwLCB4bGFiPSJ4IiwgeWxhYj0ieSIpCmBgYApgYGB7cn0KcGxvdChtb2RlbCkKYGBgCmBgYHtyfQpwcmVkaWNpdG9uCmBgYAoKQ29sdW1ucyB3aXRoIG5vdCBtYXJrZWQgd2l0aCBsZWFzdCBvbmUgKiBjYW4gYmUgc2FmZWx5IGlnbm9yZWQuCgpFTkQKCkdvYWwgb2YgbWluaW1pemluZyB0aGUgc3F1YXJlIGVycm9yIG1ha2VzIGxpbmVhciByZWdyZXNzaW9uIHZlcnkgc2Vuc2l0aXZlIHRvIG91dGxpZXJzIHRoYXQgZ3JlYXRseSBkZXZpYXRlIGluIHRoZSBvdXRwdXQu